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ATTED-II (http://atted.jp) is a database of coexpressed genes 
that was originally developed to identify functionally related 
genes in Arabidopsis and rice. Herein, we describe an 
updated version of ATTED-II, which expands this resource 
to include additional agriculturally important plants. To im- 
prove the quality of the coexpression data for Arabidopsis 
and rice, we included more gene expression data from 
microarray and RNA sequencing studies. The RNA sequen- 
cing-based coexpression data now cover 94% of the 
Arabidopsis protein-encoding genes, representing a substan- 
tial increase from previously available microarray- based 
coexpression data (76% coverage). We also generated coex- 
pression data for four dicots (soybean, poplar, grape and 
alfalfa) and one monocot (maize). As both the quantity 
and quality of expression data for the non-model species 
are generally poorer than for the model species, we verified 
coexpression data associated with these new species using 
multiple methods. First, the overall performance of 
the coexpression data was evaluated using gene ontology 
annotations and the coincidence of a genomic feature. 
Secondly, the reliability of each guide gene was determined 
by comparing coexpressed gene lists between platforms. 
With the expanded and newly evaluated coexpression 
data, ATTED-II represents an important resource for iden- 
tifying functionally related genes in agriculturally important 
plants. 

Keywords: Arabidopsis • Comparative transcriptomics • 
Database • Gene coexpression • Gene network • Non- 
model species. 

Abbreviations: AUC, area under the curve; GO, gene ontol- 
ogy; MR, mutual rank; ROC, receiver operating characteristic; 
RNAseq, RNA sequencing. 



Introduction 

Recent high-throughput sequencing technologies have made it 
possible to generate genomic and transcriptomic data for non- 
model species. Annotation of these new sequences is typically 
accomplished by comparison with annotations of known 
orthologs. However, in contrast to clear orthologous relation- 
ships that characterize animal genomes, these types of relation- 
ships can be quite complicated in plants because of gene 
duplication events (Tang et al. 2008). Gene expression patterns 
can help address this problem, i.e. distinguish between paralo- 
gous genes, by providing clues concerning their biological roles. 
Genes involved in related biological pathways are generally ex- 
pressed together, and thus, information about gene coexpres- 
sion is key to understanding biological systems at the molecular 
level. Coexpression data have been used in many different ex- 
perimental designs, including gene targeting, regulatory inves- 
tigations and identifying protein-protein interactions (Aoki 
et al. 2007, Usadel et al. 2009, Obayashi and Kinoshita 2010). 

We have constructed ATTED-II, which is a database of coex- 
pressed genes for Arabidopsis (Obayashi et al. 2007), and have 
continuously improved it to increase its functionality, e.g. by 
incorporating condition-specific coexpression and the ability to 
draw networks (Obayashi et al. 2009, Obayashi et al. 2011). 
These tools can help identify functional gene relationships, so 
that reverse genetics and molecular biological techniques can 
be used to confirm predicted gene functions (Obayashi and 
Kinoshita 2010). 

A grand challenge of plant science is to take the knowledge 
gained from model species (Arabidopsis and rice, in particular) 
and apply it to non-model species, other crops and trees 
(Godfray et al. 2010). To address this issue, we have expanded 
ATTED-II to include four dicots (soybean, poplar, grape and 
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Table 1 Coexpression data in ATTED-II version 7.1 



Species 


Version 


No. of genes 


Gene coverage (%) a 


No. of experiments 


No. of samples** 


Platform 


Release date 


Arabidopsis 


Ath.c5-0 


20,836 


76 


737 


11,171 


A-AFFY-2 


May 23, 2013 


Arabidopsis 


Ath2.c1-0 


25,838 


94 


28 


328 


RNAseq 


August 17, 2013 


Soybean 


Gma.c1-0 


15,902 


29 


31 


938 


A-AFFY-59 


May 23, 2013 


Poplar 


Ppo.c1-0 


21,909 


53 


23 


404 


A-AFFY-131 


May 23, 2013 


Grape 


Vvi.d-0 


8,351 


32 


14 


245 


A-AFFY-78 


May 23, 2013 


Alfalfa 


Mtr.c1-0 


4,166 


9 


43 


585 


A-AFFY-71 


May 23, 2013 


Rice 


Osa.c3-0 


20,625 


53 


73 


1214 


A-AFFY-126 


May 23, 2013 


Maize 


Zma.c1-0 


8,397 




47 


617 


A-AFFY-77 


May 23, 2013 



a Gene coverage indicates the percentage of protein-encoding genes (provided by Phytozome v9.1) that are included in the coexpression data set (Goodstein et al. 
2012). Statistics for maize are not provided because of poor annotation quality. 

b This column indicates the number of slides for each microarray platform and the number of runs for the RNAseq platform (Ath2). 
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alfalfa) and one monocot (maize), which will facilitate the ana- 
lysis of gene coexpression in non-model species while maintain- 
ing the reliability of the original coexpression indexes. For 
Arabidopsis, we prepared RNA sequencing (RNAseq)-based 
coexpression data and refined the microarray-based data. 
Although several databases, including the previous version of 
ATTED-II, provide coexpression data for multiple plant species 
(Toufighi et al. 2005, Mutwil et al. 2008Jupiter et al. 2009, Ogata 
et al. 2010, Hamada et al. 2011, Mutwil et al. 2011, Patel et al. 
2012, Yim et al. 2013), the quality of the data is not fully eval- 
uated. Compared with the Arabidopsis data, the quality of 
coexpression data for other organisms is quite poor, primarily 
because of the limited number of microarray experiments. We 
have previously used gene ontology (GO) annotations to assess 
the accuracy of coexpression data (Obayashi and Kinoshita 
2009, Kinoshita and Obayashi 2009), but GO annotations for 
non-model species are also less accurate than are those for 
model species. Thus, this approach did not work reliably. 

To overcome this deficiency, we measured the degree of 
coincidence between coexpression data and a 'genomic fea- 
ture', e.g. codon usage. Because genomic features are available 
for every gene, the quality of this type of information is con- 
sistent between species. Codon usage is a genomic feature 
related to coexpression (Plotkin et al. 2004, Najafabadi et al. 
2009). We measured the degree of coincidence between coex- 
pression and similarities in codon usage. The overall coinci- 
dence score seems to be a good measure of the quality of the 
coexpression data. In addition to an assessment of the overall 
performance of the gene coexpression data set, we also evalu- 
ated each coexpressed gene pair. This was accomplished by 
comparing coexpressed gene lists between platforms. If coex- 
pression of two genes is conserved in two or more species, the 
reliability of that relationship is greatly enhanced, and the like- 
lihood that experimental or technical artifacts are present is 
reduced (Stuart et al. 2003, Oti et al. 2008, Movahedi et al. 201 1, 
Obayashi and Kinoshita 2011). 

By filtering out less reliable gene coexpression data, the 
remaining data can be applied to non-model species with a 
greater degree of confidence. With the new coexpression data 
and added performance evaluations, the improved ATTED-II is 



a powerful database for identifying functionally related genes in 
agriculturally important plants. 

Results and Discussion 
New coexpression data for seven species 

We first updated the coexpression data sets for Arabidopsis and 
rice by downloading microarray data from ArrayExpress (Rustici 
et al. 2013), which increased the number of Arabidopsis 
(Arabidopsis thaliana) microarrays from 1,388 to 11,171 and 
the number of rice (Oryza sativa) microarrays from 130 to 
1,214. We also prepared new coexpression data sets for soybean 
(Glycine max), poplar (Populus sp.), grape (Vitis Vinifera), alfalfa 
(Medicago truncatula) and maize (lea mays). In addition to the 
microarray-based coexpression data, we acquired RNAseq- 
based coexpression data for Arabidopsis. This helped resolve 
microarray-specific problems, especially for poorly expressed 
genes. Although the number of experiments in the RNAseq 
version (Ath2.c1-0) is currently limited, we anticipate that 
this will be a short-term problem. One prominent characteristic 
of the RNAseq data is deep coverage. Almost all Arabidopsis 
genes are included (Ath2.c1-0, 94% of the protein-encoding 
genes), representing a significant advantage over the 
microarray-based coexpression data set (Ath.c5-0, 76% of the 
protein-encoding genes) (Table 1). RNAseq and microarray 
coexpression data sets can now be viewed at the same 
time (Fig. 1). 

Overall performance for gene coexpression data 

Because gene coexpression data sets can be constructed using 
many types of expression data and many types of methods, it is 
necessary to evaluate the data carefully. We previously used the 
predictive performance of GO annotations to evaluate coex- 
pression data sets (Obayashi and Kinoshita 2009, Kinoshita and 
Obayashi 2009) because coexpressed genes probably share 
functional properties. Herein, we partially modified our previ- 
ous assessment procedure to provide a simpler interpretation. 
We compared coexpression values between two sets of gene 
pairs; one pair shared at least one GO term, whereas the other 
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| Genes coexpressed with Atlg44575 

Coexpression degree is represented as MR value (more detail). 
Function Entrez Gene ID Coex in specific conditions (Ath) 
Download CSV 



Row filter: I Show all genes j | Column filter: [ Show only and fr-frfr species ? | 





Locus 


Alias' 
(short description) 


Reliability 


Ath c5.0 
Atlg44575 


Link 


Ath2cl.O 

t 

Atlg44575 

[Iist]r5 


Gma cl.O 

> 

LOCI 00807355 


Osa c3.0 

w 

Os01g0869800 

[list]% 


Osa c3.0 
Os04g0690800 

Dtet}8 


Ppo cl.O 

POPTRDRAFT_81 6277 

[list] r5 


Wicl.O 

§» 

LOCI 00245393 


0 


Atlg44575 


PSBS 




0.0 


u 


0.0 


0.0 


0 0 
13.4 


13 4 

0.0 


0.0 


0.0 


1 


Atlgl2900 


GAPA-2 




3.9 


1 

k 


36.7 


5.8 
24.1 


28 5 
5.8 


1 57 9 
113.2 


2.0 
12.1 


12.0 
33.9 


2 


Atlg42970 


GAPB 




4.7 




38.8 


5 8 
24.1 


28 5 
5.8 


1 57 g 
113.2 


2 0 
12.1 


12 0 
33.9 


3 


At3g55800 


SBPASE 




5.2 


k 


15.7 


11.2 


14.3 


358.0 




62.2 


4 


Atlg67740 


YCF32 




11.5 


u 

V. 


89.0 












5 


Atlg20340 


PETE2 




12.8 


L 
\L 


42.2 












6 


At4g38970 


FBA2 




13.4 


U 

V- 


6.7 


148.0 

1 0989.0 




788.4 


10.9 
26.5 

21715.5 


32.5 

7683.8 
105.3 


7 


At5g08050 


DUF1118 




14.8 


L. 


153.2 












8 


At4g21280 


PSBQA 




15.7 


u 
\c 


93.5 












9 


Atlg06680 


PSII-P 




15.9 


L 

VL 


21.4 


28.4 






171.8 
60.8 


16.5 


10 


Atlg52230 


PSI-H 




18.0 


L 


42.8 


59.3 


22.9 




56.7 


37.8 


11 


Atlg08380 


PSAO 




18.6 




65.6 








51.0 

466.8 




12 


Atlg 15820 


LHCB6 




19.6 




199.2 


70.5 


234.4 


616.0 


257.1 
185.0 


46.3 



Fig. 1 An example of a coexpressed gene list in ATTED-II. The Arabidopsis PSBS gene is used as the example of a guide gene, and coexpressed 
genes are shown along with their mutul rank (MR) values (a smaller MR value indicates a stronger coexpression). The six columns on the right 
indicate the degree of coexpression for ortholog pairs in other species (or another Arabidopsis platform). Coexpression with an MR value >200 is 
considered weak (gray text). A blank cell means that coexpression data were not available. The reliability was calculated on the basis of 
coexpression conservation and is represented by stars. Three stars indicate excellent reliability, whereas no stars indicates not reliable. This 
list is available at http://atted.jp/cgi-bin/coexJist.cgi?gene=At1g44575. 



pair did not. With the use of different coexpression thresholds, 
a receiver operating characteristic (ROC) curve was prepared 
for each coexpression data set. As a representative value of the 
ROC curve, we used AUC 0 . 0 i (the area under the ROC curve up 
to the point where the false-positive rate = 0.01 ) (McClish 1 989) 
because, when using these gene coexpression data sets, re- 
searchers typically select highly coexpressed pairs of genes for 
further study. In particular, to draw coexpressed gene networks 
in ATTED-II, we considered only the top three connections for 
each gene. Nevertheless, the conventional ROC AUC value was 
universally reflected by the order of very weak coexpression 
(e.g. several hundredths or thousandths of the strongest coex- 
pression), which is generally too weak for ordinary coexpression 
analyses. We therefore used AUC 0 . 0 i to focus on the perform- 
ance of more strongly coexpressed genes. 

Table 2 shows the predictive value of GO annotations for 
coexpression data presented in the current ATTED-II database. 
For comparison, predictive performance is also shown for pre- 
vious versions of Arabidopsis (Ath.c4-1) and rice (Osa.c2-0) 
coexpression data (italicized lines). The performance using 
Ath.c5-0 (7.27) is superior to that when Ath.c4-1 (5.97) is 
used and slightly better for rice when Osa.c3-0 (3.73) instead 
of Osa.c2-0 (3.63) is used (GO score in Table 2). One limitation 
with using GO terms to perform these quality assessments is 



that the assessment depends on the quality of the GO terms for 
each species. Even for the most intensely studied plants 
Arabidopsis and rice, the number of selected GO terms asso- 
ciated with a gene can be quite different (Table 3). We there- 
fore developed an alternative quality assessment method that 
uses codon usage. Previous reports indicate that codon usage is 
related to gene function. For example, genes with similar 
expression patterns (Plotkin et al. 2004, Najafabadi et al. 2009, 
Camiolo et al. 2012) or genes that encode interacting proteins 
(Najafabadi and Salavati 2008) have similar patterns of codon 
usage, possibly owing to varying abundance of diverse tRNAs in 
different tissues. Given the results of these reports, we con- 
structed a gene similarity matrix based on codon usage. We 
then measured the degree of coincidence between the coex- 
pression data and the codon usage similarity matrix. To meas- 
ure similarity between these two gene lists, we previously 
proposed a similarity measure COXSIM that is the weighted 
concordance rate of the top 100 genes in the two lists 
(Obayashi et al. 2013). The reasoning behind this analysis is 
similar to why we used the partial AUC 0 . 0 i in that we focused 
on eliminating false positives. Table 2 shows the degree of co- 
incidence between gene coexpression and codon usage similar- 
ity. As expected, the degree of coincidence was greatest for the 
current Arabidopsis coexpression data set (Ath.c5-0). These 
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Table 2 Development of coexpression data performance 



Species 


Version 


No. of genes 


No. of samples fa 


CO score c 


Codon score d 


Arabidopsis 


Ath.c5-0 


20,836 


11,171 


7.27 


4.02 


Arabidopsis 0 


Ath.c4-1 


20,906 


7388 


5.97 


2.48 


Arabidopsis 


Ath2.c1-0 


25,838 


328 


4.88 


2.63 


Soybean 


Gma.c1-0 


15,902 


938 




2.53 


Poplar 


Ppo.c1-0 


21,909 


404 




1.77 


Grape 


Vvi.d-0 


8,351 


245 




1.42 


Alfalfa 


Mtr.c1-0 


4,166 


585 




1.37 


Rice 


Osa.c3-0 


20,625 


1214 


3.73 


2.38 


Rice a 


Osa.c2-0 


20,725 


370 


3.63 


2.78 


Maize 


Zma.d-0 


8,397 


617 




1.96 


Random 








0.5 


1.00 



0 Italicized lines indicate previous versions of Arabidopsis and rice coexpression data. 

b This column indicates the number of slides for each microarray platform and the number of runs for the RNAseq platform (Ath2). 

c Predictive performance of the GO annotation represented by AUC 00 i (E-4). A larger score indicates a better performance. 

d Coincidence score with codon similarity represented by the median of the normalized COXSIM value. A larger score indicates a better 

performance. 



We applied similar significance levels to ATTED-II but 
made modifications because orthologous relationships are 
more complicated in plants. Based on orthologous gene 
data released by the Plant Genome Database Japan, the 
number of orthologs associated with a particular gene is 
highly variable, ranging from 0 to about 100. This variability 
makes statistical comparisons difficult. For each gene, there- 
fore, a BLASTP search was performed (e-value <1E-5; 
Altschul et al. 1997), and the top three genes were con- 
sidered as candidate gene orthologs to be used to calculate 
the COXSIM value. After selecting the maximum COXSIM 
value obtained by comparing the data in seven reference 
platforms, then the significance of the maxCOXSIM value 
was determined from the null distribution of the compari- 
sons. Note that the three candidate orthologs (identified 
using BLASTP) may not include the true functional ortho- 
log, particularly in the case of a large gene family, and that 
the lack of support data does not directly mean the guide 
gene is defective. The degree of significance is indicated by 
stars on the gene list in ATTED-II. Single, double and triple 
stars correspond to P-values <1E-4, 1 E— 12 and 1E-30, re- 
spectively. Coexpression of genes with poor reliabilities can 
be removed using row and column filters (Fig. 1). The 
number of genes at each significance level is shown in 
Fig. 2. In general, conservation-based reliability displays a 
similar trend to codon usage-based reliability (Table 2), al- 
though the number of stars depends on the existence of 
close species. For example, maize genes typically have fewer 
stars because close species lack accurate coexpression data. 
In contrast, Arabidopsis has many more three-star genes 
because coexpression comparisons were mainly performed 
using the same species (Ath and Ath2). This notwithstand- 
ing, the high coexpression values in Arabidopsis once again 
provide high confidence in the reliability of coexpression 
targets obtained, independently of the analytical platform. 
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Table 3 Number of GO 


BP terms and genes to 


validate the pre- 


dictive power of the gene coexpression data 




Coexpression data 


No. of CO 


No. of 




BP terms 


assessed genes 


Ath.c5-0 


2,785 


3,410 


Ath.c4-l a 


2,950 


3,673 


Ath2.c1-0 


2,950 


4,058 


Osa.c3-0 


679 


203 


Osa.c2-0 a 


690 


793 



a Italicized lines indicate previous versions of Arabidopsis and rice coexpression 
data. 



coincidence scores are also listed for the new species (Table 2). 
The score for soybean is the largest, whereas the alfalfa score is 
the smallest, suggesting that the alfalfa data cannot be used in 
the same manner as the Arabidopsis data. Given this result and 
the fact that alfalfa covered the smallest total number of genes, 
we did not include the alfalfa data (Mtr.c1-0) in the parallel 
view (Fig. 1). Instead, the alfalfa data are released as only a 
downloadable table to be used in combination with other 
large-scale data sets. This restriction will be removed in future 
updates. 

Performance evaluations for each guide gene 

Although the evaluation approaches described above quantify 
the reliability of each gene coexpression data set, it is also im- 
portant to assess the reliability of each guide gene. A parallel 
view of gene coexpression is one way to examine coexpression 
reliability. Analyzing multiple species can improve coexpression 
performance (Stuart et al. 2003, Oti et al. 2008), as gene coex- 
pression present in multiple species, i.e. conserved coexpres- 
sion, is more reliable. Given this logic, we previously defined 
significance levels for genes in a mammalian coexpression 
database (Obayashi et al. 2013). 
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5000 



Number of genes 
10000 15000 20000 



25000 



30000 



Arabidopsis thaliana (Ath) 
Arabidopsis thaliana (Ath2) I 
Glycine max (Gma) 
Populus sp. (Ppo) 





26% 




20% 


22% 




6^ 13% 


28% £ 


7 ] 13% 


25% 





27% 



27% 



24% 



55% 



33% 



i 





1 5% 


77% 


5 12% 


24% 




8 18° 


^ 71% 



58% 



No star 



Oryza sativa (C 
Zea mays (Zma) |s 18' 

Fig. 2 Number of genes associated with each reliability level. Reliability levels are represented by stars. Three stars indicate excellent reliability, 
whereas no stars indicates not reliable. The numbers within the bars indicate the percentage of each reliability category for each species. Genes 
with no stars include genes without orthologs. 



For mashup services using coexpression data 

In addition to the bulk download functions (http://atted.jp/ 
download.shtml) and the API settings (see http://atted.jp/help/ 
APl.shtml), coexpressed gene pairs [mutual rank (MR) <100] in 
any species are now available in SPARQL for the semantic web 
communities, using the Virtuoso Universal Server at (http:// 
atted.jp/sparql). This will promote the development of mashup 
applications with various omics data sets. In total, approximately 
50 million triplets are provided, where a pair of gene IDs is used as 
the subject and the single gene ID or coexpression strength is 
used as the object. Sample codes to link coexpression data and 
UniProt data are shown on this page. 



Materials and Methods 
Construction of gene coexpression data 

To generate microarray-based gene coexpression data, we down- 
loaded GeneChip CEL files from ArrayExpress (Rustici et al. 201 3). 
The MR value of the weighted Pearson's correlation coefficient 
was used as the measure of coexpression, as described (Obayashi 
and Kinoshita 2009). Orthologous gene relationships were down- 
loaded from the ortholog database in the Plant Genome 
Database Japan to construct the parallel view (Fig. 1). 

To generate RNAseq- based gene coexpression data, we 
downloaded data from the Sequence Read Archive (Kodama 
et al. 2012) at the DNAnexus site (http://sra.dnanexus.com/). 
These data were converted to FASTQ format and mapped onto 
the mRNA sequences of Arabidopsis, using Bowtie2 (Langmead 
and Salzberg 2012). Low quality data (total mapped counts 
< 5,000,000) were filtered out, leaving 328 runs that corres- 
ponded to 28 experiments. Mapped counts were summed for 
each gene model and used as the gene expression value. Genes 
with low levels of expression, i.e. their largest counts across all 
runs were <100, were omitted. After conversion to a base-2 
logarithm with a pseudo count of 1, quantile normalization was 



applied to the data of each experiment, and the average 
expression levels were subtracted for each gene. Using all ex- 
periments at once, Pearson's correlation coefficients for each 
gene pair were calculated, and these values were transferred to 
the MR value (Obayashi and Kinoshita 2009). Note that in this 
case, quantile normalization (Bullard et al. 2010) performed 
better for the GO test than did the following normalization 
methods: RPKM (Mortazavi et al. 2008), upper quartile 
(Bullard et al. 2010), TMM (Robinson and Oshlack 2010) and 
RLE (Anders and Huber 2010) (data not shown). 

Predictive performance of CO terms by gene 
coexpression data 

Given the different importance of GO terms along with their 
hierarchical topologies, we selected GO terms for evaluating 
coexpression data as described (Kinoshita and Obayashi 2009), 
with slight modifications. We selected GO terms associated with 
1-20 genes. Genes associated with at least one selected GO term 
were used in this assessment. The number of GO Biological 
Process terms and the number of genes used for each platform 
are shown in Table 3. All gene pairs in a platform were divided 
into two groups: those that shared at least one GO term and 
those that did not. The difference in the distributions of degrees 
of coexpression was assessed using ROC AUC 0 . 0 i. 

Coincidence score with codon similarity 

Protein-encoding sequences were retrieved from TAIR (Lamesch 
et al. 2012), RAP-DB (Sakai et al. 2013) and NCBI Gen Bank 
(Benson et al. 2013). For each gene, a 61 -dimension vector was 
constructed from the number of codons in the protein-encoding 
sequence. Pearson's correlation coefficients for vectors between 
all gene pairs were calculated and used to indicate codon usage 
similarity. For each guide gene, the gene list was then ordered on 
the basis of the strength of the codon usage similarity. Finally, the 
gene list was compared with the coexpressed gene list to assess 
the quality of the coexpression data. 
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Similarity of gene lists 

To measure similarity between two gene lists, we used 
the COXSIM value (Obayashi et al. 2013), which provides asym- 
metric modification of the ordered gene list proposed by Yang 
et al. (2006) to manage multiple gene matches between two 
lists of genes. 

COXSIM k (//'st, refJist) = ^ n(i, list, refJisi)/ ^ /' 

where n(i, list, refi, st ) is the number of genes in the top / genes of 
list that have a corresponding gene in the top / genes of ref ilst . 
Note that we did not count the number of gene pairs between 
list and refu st but the number of genes in list. Focusing on one 
list makes it possible to compare gene lists that include multiple 
gene matches. For assessment of a coexpressed gene list, we set 
k to 100, which means that we checked gene correspondence 
for the top 100 coexpressed genes, a reasonable limit when 
designing a biological experiment (Obayashi and Kinoshita 
2010). To use this measure to evaluate a guide gene, we pre- 
pared a series of COXSIM values between the guide gene of 
interest and those in other reference platforms. Genes from 
other reference platforms included the same guide gene in 
the same species and orthologous guide genes in other species. 
As the representative COXSIM value of the target guide gene, 
we used the maximal COXSIM value (maxCOXSIM). This mini- 
mized effects of unreliable gene expression data and inaccurate 
gene ortholog predictions. 

max COXSIM(//'st) = max COXSIM(//st, refJist). 

refJist 

Because the expected value of maxCOXSIM depends on the 
total number of genes in the list, for the interspecies compari- 
son in Table 2, the maxCOXSIM value was divided by its 
expected value. The significance of the maxCOXSIM value 
was also assessed using the null distribution for each platform. 
The degree of significance is represented by stars on the gene 
list in ATTED-II, where single, double and triple stars corres- 
pond to P-values <1E-4, 1 E— 12 and 1E-30, respectively. 
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